-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make language names unique #149
Conversation
This seems good to me. "Baka (South Sudan/Congo)" should probably be with "DRC" or "Congo-Kinshasa" as "Congo" is ambiguous. Should multiple countries be in a specific order? |
names = Counter([lang.name for lang in LANGUAGES.values()]) | ||
if any(count > 1 for count in names.values()): | ||
duplicates = {name: count for name, count in names.items() if count > 1} | ||
pytest.fail(f"Duplicate language names: {duplicates}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: much nicer to read sorted
tests/test_data_languages.py
Outdated
|
||
|
||
def test_language_uniqueness(): | ||
names = Counter([lang.name for lang in LANGUAGES.values()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When present, I think you want to use lang.preferredName
instead of lang.name
.
That said, you might consider using that preferredName
field in some of the language proto files in this PR. I don't actually see specific examples where that would make more sense, but worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, that found extra duplicates. :-)
This improves CI testing for textproto parseability, and the uniqueness of language names. I've noticed that we have a system for disambiguating language names: "
<Region>ian <Language>
" where there are regional variants, and "<Language>, <Script>
" where the language is written in multiple scripts; so I've tried to extend this to the cases which are currently doubling up. Additionally for Baka there are two different, unrelated languages with the same name, so I have disambiguated them with "Baka (<Region>)
".I've also removed
mlt_Latn
(Maltese) because we have a bettermt_Latn
. @moyogo, does that sound fair?